index.md - lectures.alex.balgavy.eu - Lecture notes from university.

index.md (3894B)

1 +++
2 title = 'Reinforcement learning'
3 template = 'page-math.html'
4 +++
5 # Reinforcement learning
6
7 ## What is reinforcement learning?
8 Agent is in a state, takes an action.
9 Action is selected by policy - function from states to actions.
10 The environment tells the agent its new state, and provides a reward (number, higher is better).
11 The learner adapts the policy to maximise expectation of future rewards.
12
13 Markov decision process: optimal policy may not depend on previous state, only info in current state counts.
14
15 ![90955f3da8fb0d61c2fa9f3033c65098.png](e78427ef0d0845d0ae21e1c7857d2740.png)
16
17 Sparse loss:
18 - start with imitation learning - supervised learning, copying human action
19 - reward shaping - guessing reward for intermediate states, or states close to good states
20 - auxiliary goals - curiosity, max distance traveled
21
22 policy network: NN with input of state, output of action, and a softmax output layer to produce prob distribution.
23
24 three problems of RL:
25 - non differentiable loss
26 - balance exploration and exploitation
27 - this is a classic trade-off in online learning
28 - for example, an agent in a maze may train to reach a reward of 1 that's close by and exploit that reward, and so it might never explore further and reach the 100 reward at the end of the maze
29 - delayed reward/sparse loss
30 - you might take an action that causes a negative result, but the result won't show up until some time later
31 - for example, if you start studying before an exam, that's a good thing.
32 the issue is that you started one day before, and didn't do jack shit during the preceding two weeks.
33 - credit assignment problem: how do you know which action takes the credit for the bad result?
34
35 deterministic policy - every state followed by same action.
36 probabilistic policy - all actions possible, certain actions higher probability.
37
38 ## Approaches
39 how do you choose the weights (how do you learn)?
40 simple backpropagation doesn't work - we don't have labeled examples to tell us which move to take for given state.
41
42 ### Random search
43 pick random point m in model space.
44
45 ```
46 loop:
47 pick random point m' close to m
48 if loss(m') < loss(m):
49 m <- m'
50 ```
51
52 "close to" is sampled uniformly among all points with some pre-chosen distance r from w.
53 ### Policy gradient
54 follow some semi-random policy, wait until reach reward state, then label all previous state-action pairs with final outcome.
55 i.e. if some actions were bad, on average will occur more often in sequences ending with negative reward, and on average will be more often labeled as bad.
56
57 ![442f7f9bc5e14ffbbcfd54f6ea6b72df.png](c484829362004f90be2b33a92acf7fd9.png)
58
59 $\nabla 𝔼_a r(a) = \nabla \sum_{a} p(a) r(a) = 𝔼_{a} r(a) \nabla \ln{p(a)}$, r is the ultimate reward at the end of the trajectory.
60
61 ### Q-learning
62 If I need this, I'll make better notes, can't really understand it from the slides.
63
64 ## Alpha-stuff
65 ### AlphaGo
66 starts with imitation learning.
67 improve by playing against previous iterations and self. trained by reinforcement learning using policy gradient descent to update weights.
68 during play, use Monte Carlo Tree Search, with node values being the prob that black will win from that state.
69
70 ### AlphaZero
71 learns from scratch, there's no imitation learning or reward shaping.
72 also applicable to other games like chess.
73
74 Improves AlphaGo by:
75 - combining policy and value nets
76 - viewing MCTS as policy improvement operator
77 - adding residual connections, batch normalization
78
79 ### AlphaStar
80 This shit can play starcraft.
81
82 Real time, imperfect information, large diverse action space, and no single best strategy.
83 Its behaviour is generated by a deep NN that gets input from game interface, and outputs instructions that are an action in the game.
84
85 it has a transformer torso for units
86 deep LSTM core with autoregressive policy head, and pointer network.
87 makes use of multi-agent learning.

	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules